\documentclass[10pt,a4paper, onecolumn]{article}
\setlength{\textheight}{257mm}
\setlength{\textwidth}{180mm}
\setlength{\oddsidemargin}{-10mm}
\setlength{\evensidemargin}{-10mm}
\setlength{\topmargin}{-22mm}

% Dutch style of paragraph formatting, i.e. no indents. 
\setlength{\parskip}{1.3ex plus 0.2ex minus 0.2ex}
\setlength{\parindent}{0pt}

\usepackage{seqsplit}
\usepackage{paralist}

\usepackage{hyperref}
\hypersetup{colorlinks,linkcolor=blue,citecolor= MidnightBlue,urlcolor=cyan,menucolor=black}

\usepackage{fancyhdr}
\pagestyle{fancy}
\fancyhf{}
\renewcommand{\headrulewidth}{0pt}
\renewcommand{\footrulewidth}{0pt}

\title{\textbf{Ensemble generation, model run and data visualisation scripts}}
\author{by\\\textbf{Greg Colbourn}\\\\School of Environmental Sciences\\University of East Anglia\\Norwich NR4 7TJ, UK\\\\g.colbourn@uea.ac.uk\\gcolbourn@hotmail.com}

\begin{document}
\maketitle
\thispagestyle{empty}

\section{README for run-scripts to chop GENIE runs into manageable pieces (useful for very long runs), and to generate ensembles and visualise their output using Mathematica}\label{sec:README}

\subsection{Pre-requisites}

\begin{compactitem}
\item The bash shell
\item Grid Engine software (i.e. be on a cluster)
\item Mathematica (for ensemble generation and data processing)
\end{compactitem}

\subsection{Automating splitting long runs with multiple stages into manageable pieces}\label{sec:Automating splitting long runs with multiple stages into manageable pieces}

\begin{enumerate}

\item In your root directory (the directory where the genie directory resides), make sure you have the following:

\begin{compactitem}
\item \texttt{cgenie\_output} [you should already have this].
\item \texttt{cgenie\_log} (to store the terminal output from script runs)
\item \texttt{cgenie\_archive} (to store zipped archives of results) containing directories called \texttt{fresh}, \texttt{to\_resub} and \texttt{to\_clearout}
\item \texttt{results} (to store collated results from Mathematica output in - see below)
\end{compactitem}

\item From the directory genie-main/scripts type:

\texttt{/bin/bash qsub\_cgenie\_myr\_multipart.sh \$1 \$2 \$3 \$4 \$5 \$6 \$7 \$8 \$9}

where

\begin{compactitem}
\item \texttt{\$1} = MODELID: your config name, e.g \texttt{cgenie\_eb\_go\_gs\_ac\_bg\_sg\_rg\_itfclsd\_16l\_JH\_BASE} (corresponding to \texttt{\seqsplit{genie/genie-main/configs/genie\_eb\_go\_gs\_ac\_bg\_sg\_rg\_el.config}}).
\item \texttt{\$2} = BASELINE: your baseline config name - e.g. \texttt{worbe2\_fullCC} (corresponding to \\ \texttt{genie-userconfigs/worbe2\_fullCC}).
\item \texttt{\$3} = RUNID: your run ID, e.g. \texttt{worbe2\_preindustrial\_1}.
\item \texttt{\$4} = NPARTS: the number of parts to the experiment - e.g. \texttt{3} for a 2-stage spin-up, then a main emissions experiment.
\item \texttt{\$5} = K: the experiment part counter, should be \texttt{1} (for spin-up 1).
\item \texttt{\$6} = NEWPART: is it the start of a new part? 0=no; 1=yes (set to 1 initially)
\item \texttt{\$7} = MAXYEARS: your individual job length in years, e.g. \texttt{4000}; means the job is broken into chunks of 4000 years each (this takes $<$5 hours on the UEA ESCluster).
\item \texttt{\$8} = J: your start year, usually \texttt{0}, unless restarting a run. (the script \texttt{resub\_genie\_myr\_multipart.sh} can automate restarts of ensembles - see below).
\item \texttt{\$9} = MINJOBTIME: the crash tolerance - min number of seconds allowed between resubmits before job is killed (\texttt{10} is good for testing, \texttt{60} for actual runs).
\end{compactitem}

Our example looks like this:

\texttt{/bin/bash qsub\_cgenie\_myr\_multipart.sh cgenie\_eb\_go\_gs\_ac\_bg\_sg\_rg\_itfclsd\_16l\_JH\_BASE \seqsplit{worbe2\_fullCC} ensemble\_01\_11 3 1 1 4400 0 60}

For a more simple test, you might want to try something like:

\texttt{/bin/bash qsub\_cgenie\_myr\_multipart.sh cgenie\_eb\_go\_gs\_ac\_bg\_sg\_rg\_itfclsd\_16l\_JH\_BASE \seqsplit{rokgem\_test} 3 1 1 40 0 10}

which will be quick to run, as by default the lengths of the runs for each stage are set to 6, 11 and 11. (Note that you will need to have the file rokgem\_test in the genie-userconfigs directory - there is an example file in the genie-tools/runscripts directory that you can use for this purpose).

For your BioGeM, SedGeM and ENTS output not to be overwritten each time the model is restarted you need to set \texttt{bg\_opt\_append\_data=.TRUE.}, \texttt{sg\_ctrl\_append\_data=.TRUE.} and \texttt{ents\_opt\_append\_data=.TRUE.} in your config file (as is set in \texttt{rokgem\_test; rg\_opt\_append\_data=.TRUE.} is set by default)

Note that in \texttt{qsub\_rungenie\_test} there are pauses (\texttt{sleep 2}) in between job submits. These are just as a precaution so as not to overload the cluster. I put them in because I was finding that files weren't getting written in time and things were getting lost.

\item To run multiple experiments at once, edit the script \texttt{qsub\_rungenie\_test}, typing out lines of the above form for each run. Type \texttt{./qsub\_rungenie\_test} in the terminal to run the script.

The way the scripting works is as follows: \texttt{\seqsplit{qsub\_cgenie\_myr\_multipart.sh}} is used to submit a script \texttt{\seqsplit{cgenie\_myr\_multipart.sh}} to the job queue on the cluster. \texttt{\seqsplit{cgenie\_myr\_multipart.sh}} selects the appropriate parameters by loading them in from files \texttt{\seqsplit{genie/genie-main/configs/\$1}}, \texttt{\seqsplit{genie/genie-configpatches/\$2\_0}} and \texttt{\seqsplit{genie/genie-configpatches/\$2\_\$5}} (e.g.\texttt{\seqsplit{genie/genie-main/configs/cgenie\_eb\_go\_gs\_ac\_bg\_sg\_rg\_itfclsd\_16l\_JH\_BASE.config}}, \texttt{\seqsplit{genie-userconfigs/worbe2\_fullCC\_0}} and \texttt{\seqsplit{genie-userconfigs/worbe2\_fullCC\_2}} for the second part of a full carbon cycle spin-up - see \texttt{\seqsplit{genie-main/doc/genie-howto}}: \S5.1). \texttt{\$2\_0} contains parameters common to all parts of the experiment. The script \texttt{\seqsplit{genie/genie-main/configs/\$2\_\$5.sh}} is also run, which automates tasks that need doing when progressing from one part of an experiment to the next (e.g. taking sediment burial from a spin-up and inputting it as weathering flux for an experiment); It is in this script that the length of the experiment part in years is set with RUNLENGTH. \texttt{\seqsplit{cgenie\_myr\_multipart.sh}} then calls \texttt{runcgenie.sh}, which sets time-stepping and restart locations and launches the job with \texttt{\seqsplit{genie-main/old\_genie\_example.job}}. After the job has completed, \texttt{\seqsplit{cgenie\_myr\_multipart.sh}} updates the run year and/or part of experiment (and whether it is a new part or not) appropriately and resubmits itself to the job queue. It terminates when when all parts are finished, or if the time between job submissions is less than the crash tolerance (\texttt{\$8}). Output is written to file in \texttt{\seqsplit{cgenie\_log}}. When an experiment is completed, the script \texttt{\seqsplit{qsub\_sort\_runlog.sh}} is generated, which (via calling \texttt{\seqsplit{sort\_runlog\_date.sh}}) sorts files in \texttt{\seqsplit{cgenie\_log}} into date order (this can be useful when thousands of files are being produced). \texttt{\seqsplit{sort\_runlog\_date.sh}} generates and runs the script \texttt{\seqsplit{zz\_sort\_runlog\_exe.sh}} in place in the directory that it is sorting. The file \texttt{qsub\_base.sh} is used as a basis for generating some of the \texttt{.sh} files that are run; it contains a header that specifies which queue to use on the cluster (medium), and where to direct output (\texttt{\seqsplit{cgenie\_log}}).

Note that in script \texttt{cgenie\_myr\_multipart.sh} requires that \texttt{sge}, \texttt{pgi} and \texttt{netcdf} modules are loaded onto the node being used. The lines 

\texttt{./etc/profile.d/modules.sh \\
module add shared sge \\
module add pgi/7.0.7 \\
module add netcdf-4.0}

will need to be edited to corresponding local equivalents if running elsewhere then UEA [the first two also used in \texttt{resub\_cgenie\_myr\_multipart.sh}].

\end{enumerate}

To automatically produce ensembles (using Mathematica), see \S\ref{sec:Ensemble generation and data visualisation (using Mathematica)} below.

\subsection{Name shortening}

The names for genie runs can get a bit unwieldy with all the \texttt{\seqsplit{eb\_go\_gs\_ac\_bg\_sg\_rg\_el}} etc combined with the \texttt{configpatches}. This can especially be a problem when extracting from zip files. I've found that Mathematica won't do it if the path length to the \texttt{.tar.gz} file, including the name of the file, is more than 100 characters. Thus I try to reduce the names of the archive files, whilst preserving enough information to both describe and differentiate them. In the above scripts, the script \texttt{nameshortening.sh} is called, which contains a number of abbreviations implemented via \texttt{sed}. Add to these as you see fit. The first argument of \texttt{nameshortening.sh} is the name to be shortened, the second argument is the type of shortening (in \texttt{nameshortening.sh} there are two different versions of shortening, one for output directory names, and one for script names). 


\subsection{Restarting runs}

\begin{enumerate}

\item Place the archive directories of the runs you want restarted (which will be in \texttt{\seqsplit{cgenie\_archive}}) in \texttt{\seqsplit{cgenie\_archive/to\_resub}}.

\item From the directory genie-main/scripts type:

\texttt{/bin/bash resub\_cgenie\_myr\_multipart.sh \$1 \$2 \$3}

where \texttt{\$1}, \texttt{\$2}, \texttt{\$3} are \texttt{\$4}, \texttt{\$7} and \texttt{\$9} above i.e.

\begin{compactitem}
\item \texttt{\$1} = NPARTS: the number of parts to the experiment - e.g. \texttt{3} for a 2-stage spin-up, then a main emissions experiment.
\item \texttt{\$2} = MAXYEARS: your individual job length in years, e.g. \texttt{4000}; means the job is broken into chunks of 4000 years each (this takes $<$5 hours on the UEA ESCluster).
\item \texttt{\$3} = MINJOBTIME: the crash tolerance - min number of seconds allowed between resubmits before job is killed (\texttt{10} is good for testing, \texttt{60} for actual runs).
\end{compactitem}

\end{enumerate}

The other arguments for \texttt{\seqsplit{qsub\_cgenie\_myr\_multipart.sh}} (see above), which is resubmitted by a script \texttt{\seqsplit{resub\_cgenie\_myr\_multipart\_exe.sh}} created by \texttt{\seqsplit{resub\_cgenie\_myr\_multipart.sh}}, are obtained as follows: MODELID, BASELINE, and RUNID are worked out from the archive directory name; and K and J are worked out from the name of the last modified \texttt{.tar.gz} file in the archive directory.

This process of resubmission is automated for ensembles with the Mathematica script \texttt{\seqsplit{multi\_ensemble\_rgexpt.nb}} by checking the \texttt{qstat}. See part 4 below.


\subsection{Ensemble generation and data visualisation (using Mathematica)}\label{sec:Ensemble generation and data visualisation (using Mathematica)}

[Note: if you don't have access to Mathematica, you can get a free trial at \href{http://www.wolfram.com/products/mathematica/experience/request.cgi}{\seqsplit{www.wolfram.com/products/mathematica/experience/request.cgi}}. If you've not used mathematica before, execute cells by having the cursor within them (or highlighting the cell on the right) and pressing Enter or Shift-Return; double click the cells at the side to expand and contract them. Code generally consists of nested functions (Capitalised) using [ ] brackets, (* *) denote comments, \{ \} lists, ( ) arithmetic brackets, and [[ ]] array parts. There is comprehensive documentation at \href{http://reference.wolfram.com/mathematica/guide/Mathematica.html}{\seqsplit{http://reference.wolfram.com/mathematica/guide/Mathematica.html}}. Ideally, I'd like to re-do this in something open source, like SAGE, but probably won't get round to it.]

\begin{enumerate}

\item To automatically generate ensemble(s) (of non-xml config files), start by editing your base config \texttt{.csv} file(s) (which should be in \texttt{\seqsplit{genie-userconfigs}}) -  examples are given in \texttt{\seqsplit{cgenie/genie-userconfigs}}. Put commas between different values of parameters that you want to loop over in the ensemble (there should be no other commas), \texttt{\&}s between parameters that you want to group together for the purposes of the ensemble, \texttt{!}s at the start of lines containing parameters not to be looped over and \texttt{\%}s at the start of lines giving names for output (see \texttt{\seqsplit{multi\_ensemble\_params.nb}}.

\item Edit the file \texttt{\seqsplit{multi\_ensemble\_params.nb}} (which can be run from anywhere, but you might want to save a renamed copy to the \texttt{\seqsplit{genie-userconfigs}} folder). An explanation of the parameters is provided in the file.

\item Execute the contents of the above \texttt{.nb} file and then \S ``\textbf{Set-Up For Run:}"  in \texttt{\seqsplit{multi\_ensemble\_rgexpt.nb}}. A file \texttt{\seqsplit{qsub\_rungenie\_expt\_XXXXXXXXXX}} (where the \texttt{X}s represent a date string, e.g. 201102150930 for 9.30 a.m. on the 15th of February 2011) should now be in the folder \texttt{\seqsplit{cgenie\_log/expts}}. This is basically the same as \texttt{\seqsplit{qsub\_rungenie\_test}} as described in the part 3 of \S\ref{sec:Automating splitting long runs with multiple stages into manageable pieces} above. Execute this file to set the ensemble(s) going (you'll probably need to chmod first - i.e. \texttt{chmod 755 \seqsplit{qsub\_rungenie\_expt\_XXXXXXXXXX}} at the terminal prompt. Then \texttt{\seqsplit{./qsub\_rungenie\_expt\_XXXXXXXXXX}}). If you accidentally set an ensemble going with lots of runs, you can use the \texttt{script genie-main/scripts/qdel\_ensemble.sh} to stop them; supply it with an argument that is part of the ensemble name. i.e. type \texttt{\seqsplit{./qdel\_ensemble} e01} (it uses \texttt{grep}, so this part should be unique otherwise other jobs may be killed - type \texttt{qsub | grep e01} to test first - notice that this particular example picks up jobs running on nodes starting with \texttt{01}!).

\item To check on your runs and resubmit those that have crashed, in \texttt{\seqsplit{multi\_ensemble\_params.nb}} set \texttt{runmode} to \texttt{2}, then execute \texttt{\seqsplit{multi\_ensemble\_params.nb}} and ``\textbf{Set-Up For Run:}" in \texttt{\seqsplit{multi\_ensemble\_rgexpt.nb}} (you'll also need to set \texttt{qstat}, \texttt{qstatname}, \texttt{nrunlogstolookat} and \texttt{ncharsperrunlogtooutput} - see \texttt{\seqsplit{multi\_ensemble\_params.nb}} for details). Hopefully, on screen in the Mathematica notebook will be displayed the bottom of the most recent run logs - which should give you an idea of why runs have crashed, if any have. If runs have crashed, a script will be generated in \texttt{\seqsplit{cgenie\_log/expts}} called \texttt{\seqsplit{deadjobs\_XXXXXXXXXX}} (where the \texttt{X}s represent a date string); run this script to automatically resubmit crashed runs. Note that if you want to start runs from completely afresh, it's best to delete their directories first in \texttt{genie\_output} and \texttt{genie\_archive}. Leaving these in place can interfere with the processing of this part of the script.

\item Once runs are finished, or on their way, execute ``\textbf{Collate Results:}"  to import data from time-series files and export to \texttt{.xls} files (i.e. collate the ensemble time-series into spreadsheets). Results (and any graphs created by setting \texttt{saveallgraphs}=1) will be saved in folders in the \texttt{results} folder corresponding to ensemble member names, experiment parts and date produced (if \texttt{sameoutdir} = \texttt{0} - see \texttt{multi\_ensemble\_params.nb}). Options (\texttt{divisions}, \texttt{saveallgraphs}, \texttt{timescaleanalysis}, \texttt{\seqsplit{timescalefitting}}, \texttt{savenetcdfmovies} - see \texttt{\seqsplit{multi\_ensemble\_params.nb}}) allow different combinations of runs to be collated, graphs to be plotted and saved, timescale analysis to be performed (specific to emissions runs where CO$_2$ is falling), tables to be created and exported along with graphs to LaTeX form, and animated gifs of various forms to be produced from netcdf output.

\item Execute sections ``\textbf{a. Set-Up}" and ``\textbf{b. Interactive time-series plotting}" in the section ``\textbf{Interactive Plotting:}" to get an interactive time-series plotting widget with a fairly self-explanatory GUI. The tables displayed below the graph will be of use in picking which lines (ensemble members) to display (tab). To save a graph, check and then uncheck the "save graph" box (leaving it checked will produce a new file every minute). ``\textbf{c. Interactive timescale analysis plotting}" is similar, but with timescale analysis (\textit{e}-folding curves) for global average pCO$_2$ and surface temperature.

\item For interactive netcdf output, see section ``\textbf{d. Net-CDF}". Execute section ``\textbf{i. Set-up}" and then have a play with the other sub-sections, which get more complicated (and computer resource intensive) as they advance. Each one produces a GUI with geographic output that should be fairly intuitive. Note that if there are more (or less) than 2 ensemble variables, then the code needs to be altered where the grey commented out bits are. For example:

\texttt{\{\{v1, 1, vartitles[[vs[[1]]]]\},Table[i -> varvalues[[vs[[1]], i]], \{i, nvars[[vs[[1]]]]\}]\}, \\
(*\{\{v2,1,vartitles[[vs[[2]]]]\},Table[i->varvalues[[vs[[2]],i]],\{i,nvars[[vs[[2]]]]\}]\}, \\
\{\{v3,1,vartitles[[vs[[3]]]]\},Table[i->varvalues[[vs[[3]],i]],\{i,nvars[[vs[[3]]]]\}]\},*)}

would be adjusted to:

\texttt{\{\{v1, 1, vartitles[[vs[[1]]]]\},Table[i -> varvalues[[vs[[1]], i]], \{i, nvars[[vs[[1]]]]\}]\}, \\
\{\{v2,1,vartitles[[vs[[2]]]]\},Table[i->varvalues[[vs[[2]],i]],\{i,nvars[[vs[[2]]]]\}]\}, \\
(*\{\{v3,1,vartitles[[vs[[3]]]]\},Table[i->varvalues[[vs[[3]],i]],\{i,nvars[[vs[[3]]]]\}]\},*)}

in the case of there being 3 variables. This should really be automated as it is with the Do loops in the ``\textbf{Collate Results:}" section, but I haven't managed to get it to work.

\end{enumerate}


\end{document}